Exploring Graduate Salaries: An In-Depth Analysis and Modeling¶

Table of Contents¶

  • 1. Introducton
    • 1.1 Research Questions
    • 1.2 Objectives
  • 2. Packages
  • 3. Dataset
    • 3.1 Data Scraping
    • 3.2 Data Clearning
  • 4. Exploratory Data Analysis
    • 4.1 Variables
      • 4.1.1 Subject: 'Subject area of degree'
      • 4.1.2 Salary: 'Salary band'
      • 4.1.3 Qualification: 'Level of qualification obtained'
      • 4.1.4 Other variables
      • 4.1.5 Work marker: Why only consider 'Paid employment is an activity'
    • 4.2 Compare Categories of Each Variable by Salary Bands
      • 4.2.1 Salary band by subject
      • 4.2.2 Salary band by country
      • 4.2.3 Salary band by provider
      • 4.2.4 Salary band by mode of study
      • 4.2.5 Salary band by level of qualification obtained
      • 4.2.6 Salary band by skill group
    • 4.3 Summary of EDA
  • 5. Modelling
    • 5.1 Classification
      • 5.1.1 Data processing for classification
      • 5.1.2 Model fitting and selection for classification
    • 5.2 Feature Importance Analysis by Random Forest
      • 5.2.1 Data processing for feature importance analysis
      • 5.2.2 Hyperparameter selection
      • 5.2.3 Random Forest feature importance
    • 5.3 Futher Actions: Neural Network
  • 6. Conclusion and Discussion
  • 7. References

1. Introduction¶

As employment environment becomes severe, the employment issue of graduates has always been the focus of public opinion. It is of far-reaching significance to explore the impact of multiple factors on graduates' salaries and put forward suggestions and strategies in related fields. This report draws on survey data from the Higher Education Statistics Agency on the baseline profile of UK graduates in the 2020-2021 academic year. The goal is further data visualization, analysis and modeling. The survey indicator on graduate wages covers the annual earnings (before tax) of graduates in their main jobs during the census week.

The survey covers graduates from a range of subjects across England, Scotland, Northern Ireland and Wales, ranging from first-year undergraduates to students studying for a master's degree. A variety of routes to degree qualifications are considered. Institutional data contributions are from higher education institutions and colleges of further education.

Job classifications in the survey were divided into high, medium and low categories based on constructs such as "skill level" and "skill specialization." These classifications take into account factors such as training duration, necessary work experience, and knowledge prerequisites for task performance . Additionally, the survey asked whether graduates view work as an activity or as a primary pursuit.

1.1 Research Questions¶

We have a strong interest in investigating income differences between graduates with different backgrounds, geographical areas and career choices. Our goal is to gain a comprehensive understanding of the initial earnings of graduates exhibiting different characteristics. Furthermore, we aim to explore the impact of graduates' chosen field of specialization and the degree pursued during their academic tenure on their subsequent earnings levels. The subtle interplay between these educational and occupational factors and the types of jobs graduates choose subsequently is of particular interest in explaining the determinants of earnings changes early in their careers .

1.2 Objectives¶

(a) Conduct exploratory data analysis to review the data set, employing advanced data visualization techniques to analyze salary distributions between different categories of graduates.

(b) Use the training data set to develop a classification model with the goal of predicting salary levels, using the basic characteristics of graduates as predictive features.

(c) Investigate the importance of factors that affect graduates’ salary levels, aiming to establish an importance ranking of factors based on income levels and quantify the relative impact of various factors on graduates’ salary fluctuations.

2. Packages¶

In [ ]:
#!pip install missingno
!pip install pywaffle
Requirement already satisfied: pywaffle in /usr/local/lib/python3.10/dist-packages (1.1.0)
Requirement already satisfied: fontawesomefree in /usr/local/lib/python3.10/dist-packages (from pywaffle) (6.5.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from pywaffle) (3.7.1)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (4.47.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (1.4.5)
Requirement already satisfied: numpy>=1.20 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (1.23.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (23.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->pywaffle) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->pywaffle) (1.16.0)
In [3]:
# Web Scrapping part
import requests
from bs4 import BeautifulSoup
import os
import zipfile

# EDA part
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import missingno as msno
import networkx as nx
import plotly.graph_objects as go
from pywaffle import Waffle
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import plotly as py
import plotly.graph_objs as go
plt.rcParams['figure.dpi'] = 140

# Modelling part
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.under_sampling import RandomUnderSampler

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, AdaBoostClassifier
from sklearn.naive_bayes import BernoulliNB

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, mean_squared_error, roc_curve, auc, precision_recall_curve, average_precision_score
from sklearn.preprocessing import LabelBinarizer
from sklearn.multiclass import OneVsRestClassifier

from keras import models
from keras import layers

3. Dataset¶

3.1 Data Scraping¶

In this section we will explain the details on how our dataset was scraped.

  1. Navigate to the data.gov.uk webpage to access the Higher Education Graduate Outcomes Data, then use the BeautifulSoup library to get the HTML content of the webpage. Next, locate the download link that points to the desired ZIP file containing the data. Extract the URL from the anchor tag, obtaining the zip_url for further processing.
In [4]:
url = 'https://www.data.gov.uk/dataset/37b401c3-1689-4f3c-bac4-b6cc39cdefa7/higher-education-graduate-outcomes-data'
page=requests.get(url)
soup=BeautifulSoup(page.content,'lxml')
td = soup.find_all('td')[0]
zip_url = td.find('a')['href'] # Get the url of our desired zip file
  1. Download the ZIP file from the provided URL using a GET request and extract the contents of the ZIP file into the 'table-30' folder, preserving the directory structure.
In [5]:
folder_name = 'table-30'
response = requests.get(zip_url)
os.makedirs(folder_name, exist_ok=True)
zip_file_path = os.path.join(folder_name, "table-30.zip")
with open(zip_file_path, "wb") as file:
    file.write(response.content) # Save the content to a local ZIP file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(folder_name) # Extract the contents of the ZIP file
  1. Obtain the current working directory and construct the path to the extracted CSV file, and then read the extracted CSV file into a Pandas DataFrame (df). Specify header=14 to skip the first 14 rows of the CSV file, which contains metadata. This DataFrame now holds the data for further exploration and analysis.
In [6]:
current_directory = os.getcwd() # Get the current working directory then join together
csv_file_path = os.path.join(current_directory,'table-30', "table-30-2020-21.csv")
# We will only explore the updated dataset
df = pd.read_csv(csv_file_path, header = 14)

3.2 Data Cleaning¶

In [7]:
df.head()
Out[7]:
Subject area of degree Country of provider Provider type Level of qualification obtained Mode of former study Skill group Work population marker Salary band Academic year Number Percent
0 01 Medicine and dentistry All All All All All Paid employment is an activity Less than £15,000 2020/21 0 0%
1 01 Medicine and dentistry All All All postgraduate All All Paid employment is an activity Less than £15,000 2020/21 0 0%
2 01 Medicine and dentistry All All All undergraduate All All Paid employment is an activity Less than £15,000 2020/21 0 0%
3 01 Medicine and dentistry All All First degree All All Paid employment is an activity Less than £15,000 2020/21 0 0%
4 01 Medicine and dentistry All All Other undergraduate All All Paid employment is an activity Less than £15,000 2020/21 0 0%
In [8]:
# Drop null values
df = df.dropna()
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 371700 entries, 0 to 646348
Data columns (total 11 columns):
 #   Column                           Non-Null Count   Dtype 
---  ------                           --------------   ----- 
 0   Subject area of degree           371700 non-null  object
 1   Country of provider              371700 non-null  object
 2   Provider type                    371700 non-null  object
 3   Level of qualification obtained  371700 non-null  object
 4   Mode of former study             371700 non-null  object
 5   Skill group                      371700 non-null  object
 6   Work population marker           371700 non-null  object
 7   Salary band                      371700 non-null  object
 8   Academic year                    371700 non-null  object
 9   Number                           371700 non-null  int64 
 10  Percent                          371700 non-null  object
dtypes: int64(1), object(10)
memory usage: 34.0+ MB

Initially, there was repetitive counting in the 'Number' column, mainly attributed to the 'Total' and 'All' classes. Subdatasets were subsequently filtered based on specific conditions.

In [9]:
# Only Keep total
c0 = df['Subject area of degree'] == 'Total'
c1 = df['Country of provider'] == 'All'
c2 = df['Provider type'] == 'All'
c3 = df['Level of qualification obtained'] == 'All'
c4 = df['Mode of former study'] == 'All'
c5 = df['Skill group'] == 'All'
c6 = df['Work population marker'] == 'Paid employment is an activity'
c7 = df['Salary band'] == 'Total'

c0n =  ~df['Subject area of degree'].isin(['Total','Total non-science CAH level 1','Total science CAH level 1'])
c1n = df['Country of provider'] != 'All'
c2n = df['Provider type'] != 'All'
c3n = df['Level of qualification obtained'] != 'All'
c4n = df['Mode of former study'] != 'All'
c5n = df['Skill group'] != 'All'
c7n = df['Salary band'] != 'Total'

4. Exploratory Data Analysis¶

4.1 Variables¶

4.1.1 Subject: 'Subject area of degree'¶

In [10]:
df_subject = df[c0n & c1 & c2 & c3 & c4 & c5 & c6 & c7] # Only keep subjects, others are total
subject = df_subject['Subject area of degree'].unique().tolist()
y = df_subject['Number']

fig = go.Figure(go.Treemap(
    labels = subject,
    parents = ['Subject'] * len(y),
    values = y
))

fig.update_layout(title = 'Number of Survey Participants in each subject')
fig.show()

This plot illustrates the variation in the number of survey participants across different 22 subjects. The individual count number can be checked when clicking the certain blocks.

It is evident that significant differences exist between subjects. Consequently, in the subsequent salary comparison section, percentages are considered instead of raw counts.

4.1.2 Salary: 'Salary band'¶

In [11]:
df_salary = df[c0 & c1 & c2 & c3 & c4 & c5 & c6 & c7n]
salary = df_salary['Salary band'].unique().tolist()
count_salary = pd.pivot_table(df_salary, values='Number',index='Salary band', aggfunc='sum')

count_salary_reset = count_salary.reset_index()
color_map = ['#20639B' for _ in range(14)]
color_map[4] =  '#ED553B' # color highlight


# initialize the figure
plt.figure(figsize=(8,8),dpi=200)
ax = plt.subplot(111, polar=True)
plt.axis('off')

# Constants = parameters controling the plot layout:
upperLimit = 20
lowerLimit = 1
labelPadding = 20

# Compute max and min in the dataset
max = count_salary_reset['Number'].max()

# Let's compute heights: they are a conversion of each item value in those new coordinates
# In our example, 0 in the dataset will be converted to the lowerLimit (10)
# The maximum will be converted to the upperLimit (100)
slope = (max - lowerLimit) / max
heights = slope * count_salary_reset.Number + lowerLimit

# Compute the width of each bar. In total we have 2*Pi = 360°
width = 2*np.pi / len(count_salary_reset.index)

# Compute the angle each bar is centered on:
indexes = list(range(1, len(count_salary_reset.index)+1))
angles = [element * width for element in indexes]
angles

# Draw bars
bars = ax.bar(
    x=angles,
    height=heights,
    width=width,
    bottom=lowerLimit,
    linewidth=2,
    edgecolor="white",
    color=color_map,alpha=0.8
)

# Add labels
for bar, angle, height, label in zip(bars,angles, heights, count_salary_reset["Salary band"]):

    # Labels are rotated. Rotation must be specified in degrees :(
    rotation = np.rad2deg(angle)

    # Flip some labels upside down
    alignment = ""
    if angle >= np.pi/2 and angle < 3*np.pi/2:
        alignment = "right"
        rotation = rotation + 180
    else:
        alignment = "left"

    # Finally add the labels
    ax.text(
        x=angle,
        y=lowerLimit + bar.get_height() + labelPadding,
        s=label,
        ha=alignment, fontsize=6,fontfamily='serif',
        va='center',
        rotation=rotation,
        rotation_mode="anchor")

The pie plot depicts the distribution of participants among 14 salary bands, spanning from 'Less than £15,000' to '£51,000+'. Each section represents a distinct population size. The visualization emphasizes that the majority of students received salaries ranging from £24,000 to £27,000 annually upon graduation.

4.1.3 Qualification: 'Level of qualification obtained'¶

The qualification levels include both undergraduate and postgraduate categories. Within the undergraduate category, there are three classes: First degree, Other undergraduate, and Undergraduate unknown. In the postgraduate category, it includes Postgraduate (research), Postgraduate (taught). The qualification distribution plot specifically focuses on these subclasses.

In [12]:
df_qualification = df[c0 & c1 & c2 & c3n & c4 & c5 & c6 & c7]
qualification = df_qualification['Level of qualification obtained'].unique().tolist()

count_qualification = df_qualification['Number']
count_qualification.index = qualification

fig = plt.figure(
    FigureClass=Waffle,
    rows=10,
    columns=24,
    values=count_qualification[2:9],
    colors = ('#20639B', '#ED553B', '#3CAEA3', '#F5D55C', '#845EC2'),
    title={'label': 'Qualification Distribution', 'loc': 'left','fontsize': 20},
    labels=["{}({})".format(a, b) for a, b in zip(count_qualification.index[2:9], count_qualification[2:9]) ],
    legend={'loc': 'lower left', 'bbox_to_anchor': (0, -0.3), 'ncol': 3, 'framealpha': 0, 'fontsize': 15},
    font_size=30,
    icons = 'child',
    figsize=(50, 10),
    icon_legend=True
)

The predominant group of participants possesses an undergraduate qualification at the first-degree level. Additionally, there are only 10 participants with an unknown qualification at the postgraduate level.

4.1.4 Other variables¶

They are 'Country of provider','Provider type','Mode of former study','Skill group'

In [13]:
df_sub = df[c0 & c1n & c2n & c3 & c4n & c5n & c6 & c7]
pd.pivot_table(df_sub, values='Number',index=['Country of provider','Provider type','Mode of former study','Skill group'], aggfunc='sum')
Out[13]:
Number
Country of provider Provider type Mode of former study Skill group
England Further education colleges (FECs) Full-time High skilled 1575
Low skilled 440
Medium skilled 755
Part-time High skilled 1625
Low skilled 80
Medium skilled 450
Higher education providers (HEPs) Full-time High skilled 94160
Low skilled 5680
Medium skilled 12895
Part-time High skilled 21860
Low skilled 420
Medium skilled 1605
Northern Ireland Further education colleges (FECs) Full-time High skilled 110
Low skilled 25
Medium skilled 55
Part-time High skilled 350
Low skilled 25
Medium skilled 150
Higher education providers (HEPs) Full-time High skilled 2920
Low skilled 110
Medium skilled 290
Part-time High skilled 1105
Low skilled 25
Medium skilled 100
Scotland Higher education providers (HEPs) Full-time High skilled 10615
Low skilled 620
Medium skilled 1345
Part-time High skilled 2270
Low skilled 55
Medium skilled 255
Wales Further education colleges (FECs) Part-time High skilled 35
Higher education providers (HEPs) Full-time High skilled 5550
Low skilled 455
Medium skilled 870
Part-time High skilled 1350
Low skilled 40
Medium skilled 155

From the table, it is evident that the highest number of students received full-time higher education in England and possess high skills.

4.1.5 Work marker: Why only consider 'Paid employment is an activity'¶

In [14]:
df_work_salary = df[c0 & c1 & c2 & c3 & c4 & c5 & c7n]
df_work_salary['Salary band'].value_counts().index
count_work_salary = pd.pivot_table(df_work_salary, values='Number',index='Salary band',columns='Work population marker', aggfunc='sum')
count_work_salary
Out[14]:
Work population marker Paid employment is an activity Paid employment is most important activity
Salary band
Less than £15,000 1190 1100
£15,000 - £17,999 4655 4335
£18,000 - £20,999 16615 15860
£21,000 - £23,999 22135 21315
£24,000 - £26,999 36210 35230
£27,000 - £29,999 23460 22955
£30,000 - £32,999 20105 19545
£33,000 - £35,999 12355 12060
£36,000 - £38,999 6565 6395
£39,000 - £41,999 6550 6340
£42,000 - £44,999 3600 3515
£45,000 - £47,999 4085 3975
£48,000 - £50,999 3510 3405
£51,000+ 9425 9195

Upon reviewing the pivot plot of 'Salary band' and 'Work population marker,' it is evident that there is no significant difference between the classes 'Paid employment is an activity' and 'Paid employment is the most important activity'. Therefore, for the comparison and modeling sections, we will focus exclusively on the class 'Paid employment is an activity' as it encompasses the other class.

4.2 Compare Categories of Each Variable by Salary Bands¶

4.2.1 Salary band by subject¶

In [15]:
df_subject_salary = df[c0n & c1 & c2 & c3 & c4 & c5 & c6 & c7n]
new_salary = ['Less than £21k','£21k - £30k','£30k - £39k','£39k - £48k','£48k+']
# Combine some salary bands
new_cate = {salary[0]: new_salary[0],salary[1]: new_salary[0], salary[2]: new_salary[0],
                    salary[3]: new_salary[1], salary[4]: new_salary[1], salary[5]: new_salary[1],
                    salary[6]: new_salary[2], salary[7]: new_salary[2], salary[8]:new_salary[2],
                    salary[9]: new_salary[3], salary[10]: new_salary[3], salary[11]: new_salary[3],
                    salary[12]: new_salary[4], salary[13]: new_salary[4]}

df_subject_salary['New salary band'] = df_subject_salary['Salary band'].map(new_cate)
df_subject_salary['Percent'] = df_subject_salary['Percent'].str.strip('%').astype(float)/100
y_data = list(df_subject_salary['Subject area of degree'].unique())

x_data = np.zeros((len(y_data),len(new_salary)))
for i,s in enumerate(y_data):
    a = df_subject_salary[df_subject_salary['Subject area of degree'] == s].groupby('New salary band')['Percent'].sum().reset_index()
    x_data[i, :] = a['Percent'].to_numpy()
colors = [
    "rgba(80, 150, 250, 1.0)",
    "rgba(105, 176, 250, 1.0)",
    "rgba(237, 213, 92, 0.8)",
    "rgba(255, 165, 0, 0.8)",
    "rgba(255, 140, 0, 0.8)",



]


fig = go.Figure()

for i in range(0, len(x_data[0])):
    for xd, yd in zip(x_data, y_data):
        fig.add_trace(go.Bar(
            x=[xd[i]], y=[yd],
            orientation='h',
            marker=dict(
                color=colors[i],
                line=dict(color='rgb(248, 248, 249)', width=1)
            )
        ))

fig.update_layout(
    xaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=False,
        zeroline=False,
        domain=[0.15, 1]
    ),
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=False,
        zeroline=False,
    ),
    barmode='stack',
    paper_bgcolor='rgb(248, 248, 255)',
    plot_bgcolor='rgb(248, 248, 255)',
    margin=dict(l=120, r=10, t=140, b=80),
    showlegend=False,
)

annotations = []

for yd, xd in zip(y_data, x_data):
    # labeling the y-axis
    annotations.append(dict(xref='paper', yref='y',
                            x=0.14, y=yd,
                            xanchor='right',
                            text=str(yd),
                            font=dict(size=6,
                                      color='rgb(67, 67, 67)'),
                            showarrow=False, align='right'))
    # labeling the first percentage of each bar (x_axis)
    annotations.append(dict(xref='x', yref='y',
                            x=xd[0] / 2, y=yd,
                            text=str(round(xd[0], 2)),
                            font=dict(size=7,
                                      color='rgb(248, 248, 255)'),
                            showarrow=False))
    # labeling the first Likert scale (on the top)
    if yd == y_data[-1]:
        annotations.append(dict(xref='x', yref='paper',
                                x=xd[0] / 2, y=1.1,
                                text=new_salary[0],
                                font=dict(size=5,
                                          color='rgb(67, 67, 67)'),
                                showarrow=False))
    space = xd[0]
    for i in range(1, len(xd)):
            # labeling the rest of percentages for each bar (x_axis)
            annotations.append(dict(xref='x', yref='y',
                                    x=space + (xd[i]/2), y=yd,
                                    text=str(round(xd[i], 2)),
                                    font=dict(size=7,
                                              color='rgb(248, 248, 255)'),
                                    showarrow=False))
            # labeling the Likert scale
            if yd == y_data[-1]:
                annotations.append(dict(xref='x', yref='paper',
                                        x=space + (xd[i]/2), y=1.1,
                                        text=new_salary[i],
                                        font=dict(size=5,
                                                  color='rgb(67, 67, 67)'),
                                        showarrow=False))
            space += xd[i]

fig.update_layout(
    title="Salary band by subject",
    annotations=annotations)

fig.show()

It is evident that students in the field of '01 Medicine and dentistry' tend to have the highest salaries, whereas those in '25 Design, and creative and performing arts' havecomparatively lowersalaries.

Considering both this plot and the popularity of the subjects, we will include these 11 subjects in the modelling part:

01 Medicine and dentistry

07 Physical science

09 Mathematical sciences

10 Engineering and technology

11 Computing

15 Social sciences

16 Law

17 Business and management

22 Education and teaching

25 Design, and creative and performing arts.

4.2.2. Salary band by country¶

In [16]:
df_country_salary = df[c0 & c1n & c2 & c3 & c4 & c5 & c6 & c7n]
df_country_salary = df_country_salary[['Country of provider', 'Salary band','Percent']].copy()
df_country_salary['Percent'] = df_country_salary['Percent'].str.strip('%').astype(float)/100
percent_country_salary = pd.pivot_table(df_country_salary, values='Percent', index='Salary band', columns='Country of provider', aggfunc='sum')

label = ['<£15k',
 '£15k - £18k',
 '£18k - £21k',
 '£21k - £24k',
 '£24k - £27k',
 '£27k - £30k',
 '£30k - £33k',
 '£33k - £36k',
 '£36k - £39k',
 '£39k - £42k',
 '£42k - £45k',
 '£45k - £48k',
 '£48k - £51k',
 '£51k+']

fig, ax = plt.subplots(1, 1, figsize=(20, 5))
x_axis = np.arange(len(label))

color = ['#20639B', '#ED553B', '#3CAEA3', '#F5D55C']

plt.bar(x_axis - 0.3, percent_country_salary['England'], 0.2, label = 'England',color=color[0])
plt.bar(x_axis - 0.1, percent_country_salary['Northern Ireland'], 0.2, label = 'Northern Ireland',color=color[1])
plt.bar(x_axis + 0.1, percent_country_salary['Scotland'], 0.2, label = 'Scotland',color=color[2])
plt.bar(x_axis + 0.3, percent_country_salary['Wales'], 0.2, label = 'Wales',color=color[3])


plt.xticks(x_axis, label,rotation=45,ha='right', fontsize=10)
plt.xlabel("Salary Band", fontsize=15)
plt.ylabel("Percent", fontsize=15)
plt.title("Salary band by counrty", fontsize=20)
plt.legend()
plt.show()

While there is a slight difference in percentages among the four areas, the numbers for the other three categories are significantly lower than those for England. Therefore, it is reasonable to combine all the countries together, and ignore the 'Country of provider' variable for the modeling part.

4.2.3 Salary band by provider¶

In [ ]:
df_prov_salary = df[c0 & c1 & c2n & c3 & c4 & c5 & c6 & c7n]
df_prov_salary = df_prov_salary[['Provider type', 'Salary band', 'Number']].copy()
count_prov_salary = pd.pivot_table(df_prov_salary, values='Number', index='Salary band', columns='Provider type', aggfunc='sum')

fig, ax = plt.subplots(1, 1, figsize=(12, 6))
color = ['#20639B', '#ED553B']

FEC = count_prov_salary['Further education colleges (FECs)'].reset_index()
HEP = count_prov_salary['Higher education providers (HEPs)'].reset_index()



ax.plot(HEP.index, HEP['Higher education providers (HEPs)'], color=color[0], label='HEP')
ax.fill_between(HEP.index, 0, HEP['Higher education providers (HEPs)'], color=color[0], alpha=0.9)

ax.plot(FEC.index, FEC['Further education colleges (FECs)'], color=color[1], label='FEC')
ax.fill_between(FEC.index, 0, FEC['Further education colleges (FECs)'], color=color[1], alpha=0.9)

ax.yaxis.tick_right()
ax.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .7)


for s in ['top', 'right','bottom','left']:
    ax.spines[s].set_visible(False)

ax.grid(False)

x_axis = np.arange(len(label))
plt.xticks(x_axis, label,rotation=45,ha='right', fontsize=10)

fig.text(0.13, 0.85, 'Salary band by provider', fontsize=15, fontweight='bold', fontfamily='serif')

fig.text(0.13,0.5,"FEC", fontweight="bold", fontfamily='serif', fontsize=15, color='#ED553B')
fig.text(0.17,0.5,"|", fontweight="bold", fontfamily='serif', fontsize=15, color='black')
fig.text(0.18,0.5,"HEP", fontweight="bold", fontfamily='serif', fontsize=15, color='#20639B')

ax.tick_params(axis=u'both', which=u'both',length=0)

plt.show()

Due to limited data provided by FEC, our modeling efforts will focus exclusively on Higher Education Providers (HEP).

4.2.4 Salary band by mode of study¶

In [ ]:
df_mode_salary = df[c0 & c1 & c2 & c3 & c4n & c5 & c6 & c7n]
df_mode_salary['Percent'] = df_mode_salary['Percent'].str.strip('%').astype(float)/100
df_mode_salary = df_mode_salary[['Mode of former study', 'Salary band', 'Percent']].copy()
pivot_mode_salary = pd.pivot_table(df_mode_salary, values='Percent', index='Salary band', columns='Mode of former study', aggfunc='sum')
pivot_mode_salary1 = pivot_mode_salary.reset_index()
full = pivot_mode_salary1['Full-time']
part = - pivot_mode_salary1['Part-time']
full = pivot_mode_salary1['Full-time']*100
part = - pivot_mode_salary1['Part-time']*100

fig, ax = plt.subplots(1,1, figsize=(12, 6))
ax.bar(label, full, width=0.5, color='#20639B', alpha=0.8, label='Full time')
ax.bar(label, part, width=0.5, color='#ED553B', alpha=0.8, label='Part time')

for i in range(14):
    ax.annotate(f"{-int(part[i])}%",xy=(i, part[i]-1),va = 'center', ha='center',fontweight='light', fontfamily='serif',color='#ED553B')

for i in range(14):
    ax.annotate(f"{int(full[i])}%",
                   xy=(i, full[i]+1),
                   va = 'center', ha='center',fontweight='light', fontfamily='serif',
                   color='#20639B')

for s in ['top', 'left', 'right', 'bottom']:
    ax.spines[s].set_visible(False)

ax.set_xticklabels(label,rotation=45, fontfamily='serif')
ax.set_yticks([])

ax.legend().set_visible(False)
fig.text(0.16, 1, 'Salary band by Full & Study mode', fontsize=15, fontweight='bold', fontfamily='serif')
fig.text(0.825,0.924,"Part", fontweight="bold", fontfamily='serif', fontsize=15, color='#ED553B')
fig.text(0.815,0.924,"|", fontweight="bold", fontfamily='serif', fontsize=15, color='black')
fig.text(0.775,0.924,"Full", fontweight="bold", fontfamily='serif', fontsize=15, color='#20639B')

plt.show()

The salary distribution differs significantly between full-time and part-time students, with part-time students generally earning higher salaries. This observation is reasonable considering their typically longer work experience compared to full-time students.

4.2.5 Salary band by level of qualification obtained¶

In [ ]:
df_qual_salary = df[c0 & c1 & c2 & c3n & c4 & c5 & c6 & c7n]
df_qual_salary['Percent'] = df_qual_salary['Percent'].str.strip('%').astype(float)/100
df_qual_salary = df_qual_salary[['Level of qualification obtained', 'Salary band', 'Percent']].copy()
df_qual_salary = df_qual_salary[df_qual_salary['Level of qualification obtained'] != 'Postgraduate unknown']
pivot_qual_salary = pd.pivot_table(df_qual_salary, values='Percent', index='Level of qualification obtained', columns='Salary band', aggfunc='sum')

fig, ax = plt.subplots(1, 1, figsize=(12, 12))
qual = [ 'All undergraduate', 'First degree',
        'Other undergraduate','Undergraduate unknown', 'All postgraduate','Postgraduate (research)',
        'Postgraduate (taught)']

below_30_colors = ['#00274d', '#004080', '#005cbf', '#007bff', '#6fa2f5', '#a4c7f6', '#d2e4f9', '#f0f7fc']
above_30_colors = ['#f5f5f1', '#d1d1d1', '#a2a2a2', '#737373', '#525252', '#404040', '#2e2e2e', '#1f1f1f']

# Create a colormap
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("custom_cmap",below_30_colors + above_30_colors)


sns.heatmap(pivot_qual_salary.loc[qual,salary],cmap=cmap,square=True, linewidth=2.5,cbar=False,
            annot=True,fmt='1.0%',vmax=.6,vmin=0.05,ax=ax,annot_kws={"fontsize":12})

ax.spines['top'].set_visible(True)


fig.text(.99, .725, 'Salary proportion of qualification obtained', fontweight='bold', fontfamily='serif', fontsize=15,ha='right')

ax.set_yticklabels(ax.get_yticklabels(), fontfamily='serif', rotation = 0, fontsize=11)
ax.set_xticklabels(ax.get_xticklabels(), fontfamily='serif', rotation=90, fontsize=11)

ax.set_ylabel('')
ax.set_xlabel('')
ax.tick_params(axis=u'both', which=u'both',length=0)
plt.tight_layout()
plt.show()

The lighter shade of blue represents a higher percentage, and it is noteworthy that postgraduate students tend to have higher salaries, with research postgraduate students earning the highest salaries.

4.2.6 Salary band by skill group¶

In [ ]:
df_skill_salary = df[c0 & c1 & c2 & c3 & c4 & c5n & c6 & c7n]
df_skill_salary['Percent'] = df_skill_salary['Percent'].str.strip('%').astype(float)/100
df_skill_salary = df_skill_salary[['Skill group', 'Salary band', 'Percent']].copy()
percent_skill_salary = pd.pivot_table(df_skill_salary, values='Percent', index='Salary band', columns='Skill group', aggfunc='sum')

fig, ax = plt.subplots(1, 1, figsize=(20, 5))

plt.bar(np.arange(len(percent_skill_salary.index))-0.2, height=percent_skill_salary["High skilled"], zorder=3, color='#20639B', width=0.05)
plt.scatter(np.arange(len(percent_skill_salary.index))-0.2, percent_skill_salary["High skilled"], zorder=3,s=20, color='#20639B')

plt.bar(np.arange(len(percent_skill_salary.index)), height=percent_skill_salary["Medium skilled"], zorder=3, color='#ED553B', width=0.05)
plt.scatter(np.arange(len(percent_skill_salary.index)), percent_skill_salary["Medium skilled"], zorder=3,s=20, color='#ED553B')

plt.bar(np.arange(len(percent_skill_salary.index))+0.2, height=percent_skill_salary["Low skilled"], zorder=3, color='#3CAEA3', width=0.05)
plt.scatter(np.arange(len(percent_skill_salary.index))+0.2, percent_skill_salary["Low skilled"], zorder=3,s=20, color='#3CAEA3')


x_axis = np.arange(len(label))
plt.xticks(x_axis, label,rotation=45,ha='right', fontsize=10)
plt.xlabel("Salary Band", fontsize=15, fontfamily='serif')
plt.ylabel("Percent", fontsize=15, fontfamily='serif')

for s in ['top', 'left', 'right', 'bottom']:
    ax.spines[s].set_visible(False)

fig.text(0.16, 1, 'Salary band by skill groups', fontsize=15, fontweight='bold', fontfamily='serif')
fig.text(0.725,0.8,"High skilled", fontweight="bold", fontfamily='serif', fontsize=15, color='#20639B')
fig.text(0.725,0.7,"Medium skilled", fontweight="bold", fontfamily='serif', fontsize=15, color='#ED553B')
fig.text(0.725,0.6,"Low skilled", fontweight="bold", fontfamily='serif', fontsize=15, color='#3CAEA3')

plt.show()

It is intuitive to expect that students with high skills would likely have higher salaries than their counterparts.

4.3 Summary of EDA¶

Variable Number of Categories Comparing by Salary Bands Modelling Part
Subject area of degree 22 Highest: 01 Medicine; Lowest: 22 Design Choose 10 subjects
Country of provider 4 Similarly percent in salary bands Combine them together **[Delete]**
Provider type 2 Not enough data in FECs Choose HEPs
Level of qualification obtained 5 Highest: **Postgraduate (Research)** Only 4 categories left after choosing HEPs
Mode of former study 2 **Part-time** tends to have higher salary than full-time Both of two
Skill group 3 Highest: **High skilled** All of three
Work population marker 2 One contains another Choose 'Paid employment is an activity'
Salary band 14 **Need further operation with 'Number'**
Academic year 1 '2020/21' **[Delete]**

5. Modelling¶

5.1 Classification¶

5.1.1 Data processing for classification¶

In [ ]:
selected_data = df[df['Subject area of degree'] .isin (['01 Medicine and dentistry','09 Mathematical sciences','07 Physical sciences','11 Computing','10 Engineering and technology','16 Law', '17 Business and management','01 Medicine and dentistry', '15 Social sciences','25 Design, and creative and performing arts','22 Education and teaching'])
                   & (df['Country of provider'] == 'All')
                   & (df['Provider type'] == 'Higher education providers (HEPs)')
                   & ((df['Level of qualification obtained'] != 'All undergraduate') & (df['Level of qualification obtained'] != 'All postgraduate') & (df['Level of qualification obtained'] != 'All'))
                   & (df['Mode of former study'] != 'All')
                   & (df['Skill group'] != 'All')
                   & (df['Work population marker'] == 'Paid employment is an activity')
                   & (df['Salary band'] != 'Total')]
columns_to_remove = ['Academic year','Country of provider','Provider type','Percent','Work population marker']
selected_data = selected_data.drop(columns=columns_to_remove)

For every entry within the selected_data dataset, a duplication procedure is executed based on the numerical value specified in the number column. Each line undergoes replication a number of times corresponding to the value indicated in the aforementioned column.

In [ ]:
def generate_samples(row):
    count = row['Number']
    features = [row['Subject area of degree'],row['Level of qualification obtained'],
                row['Mode of former study'], row['Skill group'], row['Salary band']]
    samples = np.repeat([features], count, axis=0)
    return pd.DataFrame(samples, columns=['Subject area of degree', 'Level of qualification obtained',
                                         'Mode of former study', 'Skill group', 'Salary band'])

class_df = pd.concat(selected_data.apply(generate_samples, axis = 1).tolist(), ignore_index = True)

Drawing upon the findings from the census research on graduate salaries in the United Kingdom for the period 2020-2021, our analysis integrates various contextual factors, including the cost of living and taxation. In light of these considerations, we have established a classification system for salary levels. Accordingly, we categorize low salary levels as those falling below £25,000, consider salary levels between £25,000 and £35,000 as medium, and designate salary levels exceeding £35,000 as high. This categorization framework is then applied to the dataset under investigation, allowing for a nuanced understanding of the distribution of graduate salaries.

In [ ]:
class_df['Salary level'] = 0 * len(class_df)
for i in class_df.index:
    salary = class_df.loc[i, "Salary band"]
    if salary in ['Less than £15,000', '£15,000 - £17,999', '£18,000 - £20,999','£21,000 - £23,999']:
        class_df.loc[i, 'Salary level'] = 'Low'
    elif salary in ['£24,000 - £26,999', '£27,000 - £29,999', '£30,000 - £32,999', '£33,000 - £35,999']:
        class_df.loc[i, 'Salary level'] = 'Med'
    else:
        class_df.loc[i, 'Salary level'] = 'High'
In [ ]:
class_df.head()
Out[ ]:
Subject area of degree Level of qualification obtained Mode of former study Skill group Salary band Salary level
0 01 Medicine and dentistry First degree Full-time High skilled £18,000 - £20,999 Low
1 01 Medicine and dentistry First degree Full-time High skilled £18,000 - £20,999 Low
2 01 Medicine and dentistry First degree Full-time High skilled £18,000 - £20,999 Low
3 01 Medicine and dentistry First degree Full-time High skilled £18,000 - £20,999 Low
4 01 Medicine and dentistry First degree Full-time High skilled £18,000 - £20,999 Low

We employ Subject area of degree, Level of qualification obtained, Mode of former study, Skill group as features, with Salary group serving as the categories for the samples.

5.1.2. Model fitting and selection for classification¶

We opted for a comparative analysis involving a selection of classifiers, including Logistic regression, K-Nearest Neighbors (KNN), Support Vector Machines (SVM) with Radial Basis Function (RBF) kernels, Random Forest, AdaBoost, and Naive Bayes.

In [ ]:
# Identify classifiers

classifiers = {
    "Logistic Regression": LogisticRegression(max_iter=10000),
    "Nearest Neighbour": KNeighborsClassifier(),
    "RBF SVM": svm.SVC(kernel = 'rbf', gamma = 0.5, C = 0.1, probability = True),
    "Random Forest": RandomForestClassifier(n_estimators = 12, criterion = 'entropy'),
    "AdaBoost": AdaBoostClassifier(),
    "Naive Bayes": BernoulliNB()
}
In [ ]:
print(class_df['Salary level'].value_counts())
Med     55065
Low     24850
High    22755
Name: Salary level, dtype: int64

Samples in the Med salary level are almost twice as numerous as those in the high and low salary groups. Therefore, during the train-test splitting procedure, a resampling method is employed to decrease the number of instances in the majority class (Med) within the training set.

In [ ]:
# Spliting train_test data

X = class_df.iloc[:, range(4)]
X = pd.get_dummies(X, columns = ['Subject area of degree', 'Level of qualification obtained',
                                 'Mode of former study', 'Skill group'], drop_first = True)
y = class_df.iloc[:, -1]

n_samples, n_features = X.shape
n_classes = len(np.unique(y))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

undersampler = RandomUnderSampler(sampling_strategy = 'majority')
X_train, y_train = undersampler.fit_resample(X_train, y_train)
In [ ]:
print(y_train.value_counts())
Low     19825
High    18231
Med     18231
Name: Salary level, dtype: int64

Following the implementation of the undersampling procedure, the dataset exhibits a notable improvement in balance, with the distribution of instances across classes approaching a more equitable distribution.

In [ ]:
# Fit and predict the data

for name, classifier in classifiers.items():
    classifier.fit(X_train, y_train)

y_preds = {}
y_scores = {}
for name, classifier in classifiers.items():
    y_preds[name] = classifier.predict(X_test)
    y_scores[name] = classifier.predict_proba(X_test)
In [ ]:
# Accuracy

accuracy = {}
for name, y_pred in y_preds.items():
    accuracy[name] = accuracy_score(y_test, y_pred)
pd.DataFrame.from_dict(accuracy, orient = "index", columns = ["Accuracy"])
Out[ ]:
Accuracy
Logistic Regression 0.636846
Nearest Neighbour 0.238726
RBF SVM 0.641229
Random Forest 0.639914
AdaBoost 0.636603
Naive Bayes 0.561069

After conducting a series of rigorous experiments and systematically tuning various parameters across each classifier, it was ultimately ascertained that the Random Forest classifier and RBF SVM achieved the first two highest classification accuracy on the given dataset. The attained accuracy level consistently approached around 64%. The classifier with the lowest accuracy is KNN.

Nevertheless, relying solely on accuracy as a metric for assessing classification performance in the context of highly imbalanced data should be avoided. Instead, alternative evaluation methods will be employed to thoroughly explore and gauge the model's effectiveness in such situations.

The assessment of the classification model's performance involves the construction and analysis of the Receiver Operating Characteristic (ROC) curve and the Precision-Recall (PR) curve. Comparative evaluations are conducted across various classifiers to discern variations in performance. The computation of True Positive Rate (TPR) and False Positive Rate (FPR) is imperative in the delineation of these curves.

Given the inherent complexity of the classification task as a multi-class problem, the direct calculation of TPR and FPR is unattainable. In response, the One vs Rest (OvR) multi-class classification strategy is employed to address this challenge. This strategy entails the transformation of the multi-class classification problem into a series of binary classification problems. Each binary classifier is specifically designed to discriminate one class from the rest, allowing for the computation of TPR and FPR in the context of the binary classification. The aggregation of these metrics across all binary classifiers yields the average TPR and FPR, providing a comprehensive evaluation of the classifier's performance.

In [ ]:
# OneVsRest classifier

label_binarizer = LabelBinarizer().fit(y_train)
y_onehot_test = label_binarizer.transform(y_test)

classifiers_ovr = {}
for name, classifier in classifiers.items():
    classifiers_ovr[name] = OneVsRestClassifier(classifier)

y_score_ovrs = {}
for name, classifier in classifiers_ovr.items():
    classifier.fit(X_train, y_train)
    y_score_ovrs[name] = classifier.predict_proba(X_test)
In [ ]:
# ROC curve

for name, y_score_ovr in y_score_ovrs.items():
    fpr, tpr, _ = roc_curve(y_onehot_test.ravel(), y_score_ovr.ravel())
    roc_auc = auc(fpr, tpr)

    plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], linestyle='--', color='black')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))

plt.show()

In the ROC curve plot, it can be observed that the Random Forest classifier exhibits the highest AUC, consistently hovering around 0.80. Its ROC curve is also situated closest to the upper-left corner, whereas the algorithm with the poorest classification performance is the KNN method.

In [ ]:
# PR curve

for name, y_score_ovr in y_score_ovrs.items():

    precision = dict()
    recall = dict()
    average_precision = dict()

    for i in range(len(np.unique(y))):
        precision[i], recall[i], _ = precision_recall_curve(y_onehot_test[:, i], y_score_ovr[:, i])
        average_precision[i] = average_precision_score(y_onehot_test[:, i], y_score_ovr[:, i])

    precision["micro"], recall["micro"], _ = precision_recall_curve(y_onehot_test.ravel(), y_score_ovr.ravel())
    average_precision["micro"] = average_precision_score(y_onehot_test, y_score_ovr, average="micro")

    plt.plot(recall["micro"], precision["micro"], label=f'{name} (AP = {average_precision["micro"]:.2f})')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Average Precision-Recall Curve')
plt.legend(loc='center left', bbox_to_anchor = (1, 0.5))

plt.show()

In the average PR curve, the curve corresponding to the Random forest and SVM classifier are predominantly positioned closest to the upper-right corner, achieving an average precision of 64%. Conversely, the KNN classifier exhibits the least favorable classification performance. The evaluation of classifiers for highly imbalanced data benefits more accurately from the PR curve as compared to the ROC curve. The advantages of random forests are particularly pronounced in this context.

The obtained findings align with the theoretical analysis. In the given dataset, the classification features exclusively consist of categorical variables, and the random forest algorithm, grounded in the aggregate voting outcomes of numerous decision trees, emerges as a fundamental classification approach. Demonstrating a high degree of compatibility with discrete data, the random forest method exhibits a notable resilience against overfitting. Notably, KNN displays heightened sensitivity to highly imbalanced data, resulting in the poorest performance on this particular dataset.

In [ ]:
y_best_pred = y_preds['Random Forest']

print(classification_report(y_test, y_best_pred)) # Classification report
pd.crosstab(y_test, y_best_pred, rownames = ['Actual Salary level'], colnames = ['Predicted Salary level']) # Presudo confusion matrix
              precision    recall  f1-score   support

        High       0.59      0.62      0.60      4524
         Low       0.56      0.61      0.58      5025
         Med       0.71      0.66      0.68     10985

    accuracy                           0.64     20534
   macro avg       0.62      0.63      0.62     20534
weighted avg       0.65      0.64      0.64     20534

Out[ ]:
Predicted Salary level High Low Med
Actual Salary level
High 2808 382 1334
Low 330 3068 1627
Med 1647 2074 7264

5.2 Feature Importance Analysis by Random Forest¶

5.2.1 Data processing for feature importance analysis¶

In this section, rather than comparing salaries within individual categories, our objective is to identify the key features across the entire dataset that significantly influence the salary range. As we previously determined Random Forest as the best model for our classification problem, we will use the best Random Forest model with the smallest Mean Square Error (MSE). Our approach involves comparing the importance scores of each feature and determining which features have a greater impact on our model.

To begin, we create a new dataframe by separating each salary range into separate columns. Same as in the previous classification section, we convert our categorical variables into dummy variables with values 0 and 1. Subsequently, we employ a pivot table to group data with identical features, resulting in a new dataframe where each row represents the number of individuals within each salary range.

In [ ]:
new_df = selected_data
new_df = pd.get_dummies(new_df, columns=['Subject area of degree','Level of qualification obtained', 'Mode of former study', 'Skill group'], prefix='', prefix_sep='')
new_df['Salary_band_range'] = new_df['Salary band'].str.split(' - ').str[0]
# Pivot the table by grouping together workers with same features
pivot_df = pd.pivot_table(new_df, values='Number', index=['01 Medicine and dentistry',
       '07 Physical sciences', '09 Mathematical sciences',
       '10 Engineering and technology', '11 Computing', '15 Social sciences',
       '16 Law', '17 Business and management', '22 Education and teaching',
       '25 Design, and creative and performing arts', 'First degree',
       'Other undergraduate', 'Postgraduate (research)',
       'Postgraduate (taught)', 'Full-time', 'Part-time', 'High skilled',
       'Low skilled', 'Medium skilled'], columns='Salary_band_range', aggfunc='sum', fill_value=0)
pivot_df.reset_index(inplace=True) # Reset the index

pivot_df.head()
Out[ ]:
Salary_band_range 01 Medicine and dentistry 07 Physical sciences 09 Mathematical sciences 10 Engineering and technology 11 Computing 15 Social sciences 16 Law 17 Business and management 22 Education and teaching 25 Design, and creative and performing arts ... £24,000 £27,000 £30,000 £33,000 £36,000 £39,000 £42,000 £45,000 £48,000 £51,000+
0 0 0 0 0 0 0 0 0 0 1 ... 10 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 1 ... 40 15 35 15 15 15 15 10 10 30
2 0 0 0 0 0 0 0 0 0 1 ... 20 10 5 0 5 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 1 ... 5 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 1 ... 125 70 60 30 10 15 10 15 5 20

5 rows × 33 columns

Then we normalize the number of each salary range to ensure comparable scales for comparing the importance of our variables.

In [ ]:
Y = pivot_df.iloc[:,19:33]
X = pivot_df.iloc[:,0:19]
row_sums = Y.sum(axis=1)
row_sums[row_sums == 0] = 1 # Replace zero row sums with 1 to avoid division by zero
normalized_Y = Y.div(row_sums, axis=0)# Divide each element in the DataFrame by the corresponding row sum

5.2.2 Hyperparameter selection¶

We choose the number of estimators of the Random Forest model by comparing the MSE, the model with the smallest MSE is considered as the 'best' model.

In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, normalized_Y, test_size=0.2, random_state=21)
n_estimators_range = range(1, 101) # Set the range of estimators
mse_values = []

for n_estimators in n_estimators_range:
    rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=21)
    rf_model.fit(X_train, y_train)
    y_pred = rf_model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_values.append(mse)

fig, ax = plt.subplots(figsize=(8,6))
ax.plot(n_estimators_range, mse_values, 'ro-')
plt.xlabel('Number of Estimators')
plt.ylabel('Mean Squared Error (MSE)')
plt.title('MSE vs Number of Estimators in Random Forest')
plt.show()

5.2.3 Random Forest feature importance¶

In [ ]:
best_n_estimators = n_estimators_range[np.argmin(mse_values)]# Get the number of estimators of our best model
rf_model = RandomForestRegressor(n_estimators=best_n_estimators, random_state=21)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)

importances = pd.DataFrame({'Columns':X_train.columns,'Feature_Importances':rf_model.feature_importances_})
importances = importances.sort_values(by='Feature_Importances',ascending=False)
fig, ax = plt.subplots(figsize=(8,6))
ax = sns.barplot(x=importances['Feature_Importances'], y=importances['Columns'], color = '#FF5733')
sns.despine()
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Random Forest Feature Importance')
plt.show()

The importance score of each column indicates its contribution to the model's predictions for the associated category, with higher importance indicating a stronger impact on predictions. Here we can find that 'High skilled' is the most important feature, this highlights that individuals classified as "High skilled" in the 'Skill group' variable have a significant impact on predicting the salary range. The second most important feature is Postgraduate (research), which indicates that individuals with a postgraduate research qualification have distinctive characteristics that strongly influence their salary range. The 'Mode of former study' variable with the category "Part-time" is also influential, which implies that individuals who is part-time study exhibit specific patterns that contribute to predicting their salary range. It is worth noting that these important features for determining salary ranges are consistent with those analysed in the previous EDA section, which indicates that our model performs well for predicting salary ranges by given variables.

5.3 Futher Actions: Neural Network¶

In this section, we used a Neural Network model to address our multi-class classification problem. By creating three layers, we used the categorical crossentropy loss function and calculated the accuracy for each epoch.

In [ ]:
# Sequential model
nn_model = models.Sequential()
nn_model.add(layers.Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(19,)))
nn_model.add(layers.Dense(10,activation='relu', kernel_initializer='he_normal'))
nn_model.add(layers.Dense(14, activation='softmax'))
nn_model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
history = nn_model.fit(X, normalized_Y, epochs=100, batch_size=32, validation_split=0.3, verbose = 2)
Epoch 1/100
3/3 - 1s - loss: 2.6823 - accuracy: 0.0316 - val_loss: 2.6691 - val_accuracy: 0.0714 - 1s/epoch - 390ms/step
Epoch 2/100
3/3 - 0s - loss: 2.6714 - accuracy: 0.0316 - val_loss: 2.6657 - val_accuracy: 0.0714 - 53ms/epoch - 18ms/step
Epoch 3/100
3/3 - 0s - loss: 2.6617 - accuracy: 0.0737 - val_loss: 2.6626 - val_accuracy: 0.0952 - 38ms/epoch - 13ms/step
Epoch 4/100
3/3 - 0s - loss: 2.6520 - accuracy: 0.0947 - val_loss: 2.6593 - val_accuracy: 0.0952 - 43ms/epoch - 14ms/step
Epoch 5/100
3/3 - 0s - loss: 2.6434 - accuracy: 0.1053 - val_loss: 2.6561 - val_accuracy: 0.0952 - 55ms/epoch - 18ms/step
Epoch 6/100
3/3 - 0s - loss: 2.6353 - accuracy: 0.1368 - val_loss: 2.6531 - val_accuracy: 0.0952 - 41ms/epoch - 14ms/step
Epoch 7/100
3/3 - 0s - loss: 2.6280 - accuracy: 0.1368 - val_loss: 2.6500 - val_accuracy: 0.0952 - 40ms/epoch - 13ms/step
Epoch 8/100
3/3 - 0s - loss: 2.6208 - accuracy: 0.1579 - val_loss: 2.6470 - val_accuracy: 0.0952 - 45ms/epoch - 15ms/step
Epoch 9/100
3/3 - 0s - loss: 2.6144 - accuracy: 0.1579 - val_loss: 2.6441 - val_accuracy: 0.1190 - 45ms/epoch - 15ms/step
Epoch 10/100
3/3 - 0s - loss: 2.6082 - accuracy: 0.1579 - val_loss: 2.6414 - val_accuracy: 0.1190 - 54ms/epoch - 18ms/step
Epoch 11/100
3/3 - 0s - loss: 2.6015 - accuracy: 0.1789 - val_loss: 2.6384 - val_accuracy: 0.1190 - 37ms/epoch - 12ms/step
Epoch 12/100
3/3 - 0s - loss: 2.5955 - accuracy: 0.1789 - val_loss: 2.6354 - val_accuracy: 0.0952 - 59ms/epoch - 20ms/step
Epoch 13/100
3/3 - 0s - loss: 2.5898 - accuracy: 0.1789 - val_loss: 2.6326 - val_accuracy: 0.0952 - 55ms/epoch - 18ms/step
Epoch 14/100
3/3 - 0s - loss: 2.5838 - accuracy: 0.1789 - val_loss: 2.6296 - val_accuracy: 0.0714 - 59ms/epoch - 20ms/step
Epoch 15/100
3/3 - 0s - loss: 2.5777 - accuracy: 0.1789 - val_loss: 2.6267 - val_accuracy: 0.0714 - 54ms/epoch - 18ms/step
Epoch 16/100
3/3 - 0s - loss: 2.5718 - accuracy: 0.2000 - val_loss: 2.6237 - val_accuracy: 0.0714 - 39ms/epoch - 13ms/step
Epoch 17/100
3/3 - 0s - loss: 2.5654 - accuracy: 0.2105 - val_loss: 2.6204 - val_accuracy: 0.0714 - 53ms/epoch - 18ms/step
Epoch 18/100
3/3 - 0s - loss: 2.5592 - accuracy: 0.2211 - val_loss: 2.6172 - val_accuracy: 0.0714 - 42ms/epoch - 14ms/step
Epoch 19/100
3/3 - 0s - loss: 2.5530 - accuracy: 0.2526 - val_loss: 2.6141 - val_accuracy: 0.0714 - 37ms/epoch - 12ms/step
Epoch 20/100
3/3 - 0s - loss: 2.5461 - accuracy: 0.2526 - val_loss: 2.6109 - val_accuracy: 0.0714 - 37ms/epoch - 12ms/step
Epoch 21/100
3/3 - 0s - loss: 2.5395 - accuracy: 0.2632 - val_loss: 2.6076 - val_accuracy: 0.0714 - 60ms/epoch - 20ms/step
Epoch 22/100
3/3 - 0s - loss: 2.5326 - accuracy: 0.2737 - val_loss: 2.6042 - val_accuracy: 0.0714 - 54ms/epoch - 18ms/step
Epoch 23/100
3/3 - 0s - loss: 2.5251 - accuracy: 0.2842 - val_loss: 2.6007 - val_accuracy: 0.0714 - 54ms/epoch - 18ms/step
Epoch 24/100
3/3 - 0s - loss: 2.5180 - accuracy: 0.2737 - val_loss: 2.5972 - val_accuracy: 0.0714 - 56ms/epoch - 19ms/step
Epoch 25/100
3/3 - 0s - loss: 2.5101 - accuracy: 0.2737 - val_loss: 2.5934 - val_accuracy: 0.0714 - 44ms/epoch - 15ms/step
Epoch 26/100
3/3 - 0s - loss: 2.5026 - accuracy: 0.2632 - val_loss: 2.5895 - val_accuracy: 0.0952 - 42ms/epoch - 14ms/step
Epoch 27/100
3/3 - 0s - loss: 2.4945 - accuracy: 0.2526 - val_loss: 2.5855 - val_accuracy: 0.0952 - 40ms/epoch - 13ms/step
Epoch 28/100
3/3 - 0s - loss: 2.4864 - accuracy: 0.2421 - val_loss: 2.5815 - val_accuracy: 0.0714 - 48ms/epoch - 16ms/step
Epoch 29/100
3/3 - 0s - loss: 2.4775 - accuracy: 0.2421 - val_loss: 2.5776 - val_accuracy: 0.1190 - 54ms/epoch - 18ms/step
Epoch 30/100
3/3 - 0s - loss: 2.4690 - accuracy: 0.2421 - val_loss: 2.5733 - val_accuracy: 0.1190 - 40ms/epoch - 13ms/step
Epoch 31/100
3/3 - 0s - loss: 2.4598 - accuracy: 0.2421 - val_loss: 2.5690 - val_accuracy: 0.1190 - 55ms/epoch - 18ms/step
Epoch 32/100
3/3 - 0s - loss: 2.4499 - accuracy: 0.2526 - val_loss: 2.5644 - val_accuracy: 0.1190 - 42ms/epoch - 14ms/step
Epoch 33/100
3/3 - 0s - loss: 2.4404 - accuracy: 0.2632 - val_loss: 2.5601 - val_accuracy: 0.1190 - 57ms/epoch - 19ms/step
Epoch 34/100
3/3 - 0s - loss: 2.4304 - accuracy: 0.2526 - val_loss: 2.5563 - val_accuracy: 0.1190 - 40ms/epoch - 13ms/step
Epoch 35/100
3/3 - 0s - loss: 2.4207 - accuracy: 0.2632 - val_loss: 2.5525 - val_accuracy: 0.1429 - 56ms/epoch - 19ms/step
Epoch 36/100
3/3 - 0s - loss: 2.4114 - accuracy: 0.2737 - val_loss: 2.5489 - val_accuracy: 0.0952 - 38ms/epoch - 13ms/step
Epoch 37/100
3/3 - 0s - loss: 2.4016 - accuracy: 0.2632 - val_loss: 2.5456 - val_accuracy: 0.0952 - 42ms/epoch - 14ms/step
Epoch 38/100
3/3 - 0s - loss: 2.3924 - accuracy: 0.2526 - val_loss: 2.5419 - val_accuracy: 0.0952 - 55ms/epoch - 18ms/step
Epoch 39/100
3/3 - 0s - loss: 2.3835 - accuracy: 0.2526 - val_loss: 2.5382 - val_accuracy: 0.0952 - 38ms/epoch - 13ms/step
Epoch 40/100
3/3 - 0s - loss: 2.3743 - accuracy: 0.2632 - val_loss: 2.5346 - val_accuracy: 0.0952 - 39ms/epoch - 13ms/step
Epoch 41/100
3/3 - 0s - loss: 2.3655 - accuracy: 0.2632 - val_loss: 2.5308 - val_accuracy: 0.0952 - 38ms/epoch - 13ms/step
Epoch 42/100
3/3 - 0s - loss: 2.3565 - accuracy: 0.2526 - val_loss: 2.5266 - val_accuracy: 0.0714 - 53ms/epoch - 18ms/step
Epoch 43/100
3/3 - 0s - loss: 2.3485 - accuracy: 0.2526 - val_loss: 2.5230 - val_accuracy: 0.0952 - 37ms/epoch - 12ms/step
Epoch 44/100
3/3 - 0s - loss: 2.3399 - accuracy: 0.2316 - val_loss: 2.5188 - val_accuracy: 0.0952 - 36ms/epoch - 12ms/step
Epoch 45/100
3/3 - 0s - loss: 2.3313 - accuracy: 0.2421 - val_loss: 2.5145 - val_accuracy: 0.0952 - 41ms/epoch - 14ms/step
Epoch 46/100
3/3 - 0s - loss: 2.3233 - accuracy: 0.2526 - val_loss: 2.5103 - val_accuracy: 0.1190 - 38ms/epoch - 13ms/step
Epoch 47/100
3/3 - 0s - loss: 2.3152 - accuracy: 0.2526 - val_loss: 2.5058 - val_accuracy: 0.1190 - 44ms/epoch - 15ms/step
Epoch 48/100
3/3 - 0s - loss: 2.3074 - accuracy: 0.2421 - val_loss: 2.5013 - val_accuracy: 0.1429 - 55ms/epoch - 18ms/step
Epoch 49/100
3/3 - 0s - loss: 2.2992 - accuracy: 0.2526 - val_loss: 2.4969 - val_accuracy: 0.1190 - 66ms/epoch - 22ms/step
Epoch 50/100
3/3 - 0s - loss: 2.2917 - accuracy: 0.2947 - val_loss: 2.4930 - val_accuracy: 0.1190 - 56ms/epoch - 19ms/step
Epoch 51/100
3/3 - 0s - loss: 2.2840 - accuracy: 0.3053 - val_loss: 2.4887 - val_accuracy: 0.1190 - 47ms/epoch - 16ms/step
Epoch 52/100
3/3 - 0s - loss: 2.2769 - accuracy: 0.3263 - val_loss: 2.4840 - val_accuracy: 0.1190 - 40ms/epoch - 13ms/step
Epoch 53/100
3/3 - 0s - loss: 2.2699 - accuracy: 0.3263 - val_loss: 2.4797 - val_accuracy: 0.0952 - 37ms/epoch - 12ms/step
Epoch 54/100
3/3 - 0s - loss: 2.2625 - accuracy: 0.3368 - val_loss: 2.4748 - val_accuracy: 0.1429 - 60ms/epoch - 20ms/step
Epoch 55/100
3/3 - 0s - loss: 2.2557 - accuracy: 0.3368 - val_loss: 2.4707 - val_accuracy: 0.1429 - 56ms/epoch - 19ms/step
Epoch 56/100
3/3 - 0s - loss: 2.2491 - accuracy: 0.3789 - val_loss: 2.4667 - val_accuracy: 0.1429 - 55ms/epoch - 18ms/step
Epoch 57/100
3/3 - 0s - loss: 2.2428 - accuracy: 0.4211 - val_loss: 2.4623 - val_accuracy: 0.2143 - 55ms/epoch - 18ms/step
Epoch 58/100
3/3 - 0s - loss: 2.2361 - accuracy: 0.4105 - val_loss: 2.4585 - val_accuracy: 0.2381 - 55ms/epoch - 18ms/step
Epoch 59/100
3/3 - 0s - loss: 2.2303 - accuracy: 0.4211 - val_loss: 2.4545 - val_accuracy: 0.2381 - 38ms/epoch - 13ms/step
Epoch 60/100
3/3 - 0s - loss: 2.2243 - accuracy: 0.4526 - val_loss: 2.4506 - val_accuracy: 0.2381 - 40ms/epoch - 13ms/step
Epoch 61/100
3/3 - 0s - loss: 2.2188 - accuracy: 0.4632 - val_loss: 2.4471 - val_accuracy: 0.2381 - 55ms/epoch - 18ms/step
Epoch 62/100
3/3 - 0s - loss: 2.2130 - accuracy: 0.4421 - val_loss: 2.4436 - val_accuracy: 0.2381 - 38ms/epoch - 13ms/step
Epoch 63/100
3/3 - 0s - loss: 2.2077 - accuracy: 0.4526 - val_loss: 2.4407 - val_accuracy: 0.2143 - 54ms/epoch - 18ms/step
Epoch 64/100
3/3 - 0s - loss: 2.2019 - accuracy: 0.4526 - val_loss: 2.4376 - val_accuracy: 0.2619 - 40ms/epoch - 13ms/step
Epoch 65/100
3/3 - 0s - loss: 2.1973 - accuracy: 0.4526 - val_loss: 2.4348 - val_accuracy: 0.2619 - 40ms/epoch - 13ms/step
Epoch 66/100
3/3 - 0s - loss: 2.1921 - accuracy: 0.4737 - val_loss: 2.4323 - val_accuracy: 0.3095 - 57ms/epoch - 19ms/step
Epoch 67/100
3/3 - 0s - loss: 2.1880 - accuracy: 0.4842 - val_loss: 2.4286 - val_accuracy: 0.3095 - 71ms/epoch - 24ms/step
Epoch 68/100
3/3 - 0s - loss: 2.1830 - accuracy: 0.4947 - val_loss: 2.4263 - val_accuracy: 0.3095 - 61ms/epoch - 20ms/step
Epoch 69/100
3/3 - 0s - loss: 2.1789 - accuracy: 0.5368 - val_loss: 2.4237 - val_accuracy: 0.3095 - 38ms/epoch - 13ms/step
Epoch 70/100
3/3 - 0s - loss: 2.1743 - accuracy: 0.5474 - val_loss: 2.4209 - val_accuracy: 0.3095 - 55ms/epoch - 18ms/step
Epoch 71/100
3/3 - 0s - loss: 2.1694 - accuracy: 0.5474 - val_loss: 2.4184 - val_accuracy: 0.3095 - 39ms/epoch - 13ms/step
Epoch 72/100
3/3 - 0s - loss: 2.1651 - accuracy: 0.5474 - val_loss: 2.4157 - val_accuracy: 0.3095 - 54ms/epoch - 18ms/step
Epoch 73/100
3/3 - 0s - loss: 2.1609 - accuracy: 0.5684 - val_loss: 2.4132 - val_accuracy: 0.3095 - 45ms/epoch - 15ms/step
Epoch 74/100
3/3 - 0s - loss: 2.1565 - accuracy: 0.5579 - val_loss: 2.4107 - val_accuracy: 0.3095 - 53ms/epoch - 18ms/step
Epoch 75/100
3/3 - 0s - loss: 2.1525 - accuracy: 0.5474 - val_loss: 2.4089 - val_accuracy: 0.3095 - 38ms/epoch - 13ms/step
Epoch 76/100
3/3 - 0s - loss: 2.1485 - accuracy: 0.5474 - val_loss: 2.4075 - val_accuracy: 0.3095 - 40ms/epoch - 13ms/step
Epoch 77/100
3/3 - 0s - loss: 2.1452 - accuracy: 0.5579 - val_loss: 2.4067 - val_accuracy: 0.3095 - 55ms/epoch - 18ms/step
Epoch 78/100
3/3 - 0s - loss: 2.1415 - accuracy: 0.5579 - val_loss: 2.4062 - val_accuracy: 0.3095 - 38ms/epoch - 13ms/step
Epoch 79/100
3/3 - 0s - loss: 2.1380 - accuracy: 0.5579 - val_loss: 2.4065 - val_accuracy: 0.3095 - 56ms/epoch - 19ms/step
Epoch 80/100
3/3 - 0s - loss: 2.1347 - accuracy: 0.5684 - val_loss: 2.4064 - val_accuracy: 0.3095 - 38ms/epoch - 13ms/step
Epoch 81/100
3/3 - 0s - loss: 2.1316 - accuracy: 0.5684 - val_loss: 2.4050 - val_accuracy: 0.3095 - 55ms/epoch - 18ms/step
Epoch 82/100
3/3 - 0s - loss: 2.1285 - accuracy: 0.5684 - val_loss: 2.4047 - val_accuracy: 0.3095 - 57ms/epoch - 19ms/step
Epoch 83/100
3/3 - 0s - loss: 2.1256 - accuracy: 0.5789 - val_loss: 2.4036 - val_accuracy: 0.3095 - 53ms/epoch - 18ms/step
Epoch 84/100
3/3 - 0s - loss: 2.1226 - accuracy: 0.5789 - val_loss: 2.4037 - val_accuracy: 0.3095 - 40ms/epoch - 13ms/step
Epoch 85/100
3/3 - 0s - loss: 2.1199 - accuracy: 0.5895 - val_loss: 2.4036 - val_accuracy: 0.3095 - 42ms/epoch - 14ms/step
Epoch 86/100
3/3 - 0s - loss: 2.1171 - accuracy: 0.5895 - val_loss: 2.4034 - val_accuracy: 0.3095 - 41ms/epoch - 14ms/step
Epoch 87/100
3/3 - 0s - loss: 2.1147 - accuracy: 0.5895 - val_loss: 2.4033 - val_accuracy: 0.3095 - 53ms/epoch - 18ms/step
Epoch 88/100
3/3 - 0s - loss: 2.1121 - accuracy: 0.6000 - val_loss: 2.4031 - val_accuracy: 0.3095 - 39ms/epoch - 13ms/step
Epoch 89/100
3/3 - 0s - loss: 2.1098 - accuracy: 0.6105 - val_loss: 2.4026 - val_accuracy: 0.2857 - 38ms/epoch - 13ms/step
Epoch 90/100
3/3 - 0s - loss: 2.1076 - accuracy: 0.6105 - val_loss: 2.4027 - val_accuracy: 0.2857 - 58ms/epoch - 19ms/step
Epoch 91/100
3/3 - 0s - loss: 2.1051 - accuracy: 0.6105 - val_loss: 2.4027 - val_accuracy: 0.3095 - 57ms/epoch - 19ms/step
Epoch 92/100
3/3 - 0s - loss: 2.1030 - accuracy: 0.6105 - val_loss: 2.4026 - val_accuracy: 0.3095 - 56ms/epoch - 19ms/step
Epoch 93/100
3/3 - 0s - loss: 2.1009 - accuracy: 0.6105 - val_loss: 2.4017 - val_accuracy: 0.3095 - 61ms/epoch - 20ms/step
Epoch 94/100
3/3 - 0s - loss: 2.0991 - accuracy: 0.6211 - val_loss: 2.4006 - val_accuracy: 0.3095 - 39ms/epoch - 13ms/step
Epoch 95/100
3/3 - 0s - loss: 2.0970 - accuracy: 0.6211 - val_loss: 2.4009 - val_accuracy: 0.3095 - 55ms/epoch - 18ms/step
Epoch 96/100
3/3 - 0s - loss: 2.0950 - accuracy: 0.6211 - val_loss: 2.4004 - val_accuracy: 0.3095 - 38ms/epoch - 13ms/step
Epoch 97/100
3/3 - 0s - loss: 2.0932 - accuracy: 0.6211 - val_loss: 2.4008 - val_accuracy: 0.3095 - 39ms/epoch - 13ms/step
Epoch 98/100
3/3 - 0s - loss: 2.0915 - accuracy: 0.6211 - val_loss: 2.4011 - val_accuracy: 0.3095 - 56ms/epoch - 19ms/step
Epoch 99/100
3/3 - 0s - loss: 2.0897 - accuracy: 0.6211 - val_loss: 2.4005 - val_accuracy: 0.3095 - 54ms/epoch - 18ms/step
Epoch 100/100
3/3 - 0s - loss: 2.0881 - accuracy: 0.6211 - val_loss: 2.4004 - val_accuracy: 0.3095 - 53ms/epoch - 18ms/step
In [ ]:
# Visualising model's performance by the loss funtion
fig, ax = plt.subplots(1, 1, figsize=(8, 4))
ax.plot(np.arange(100), history.history['loss'], 'b-', label='loss')
xlab, ylab = ax.set_xlabel('Epoch'), ax.set_ylabel('Loss')
In [ ]:
fig, ax = plt.subplots(1, 1, figsize=(8, 4))
ax.plot(np.arange(100), history.history['accuracy'], 'b-', label='accuracy')
xlab, ylab = ax.set_xlabel('Epoch'), ax.set_ylabel('Accuracy')

From these two graphs, we can infer that each epoch results in a lower loss and higher accuracy. However, it is worth noting that the peak accuracy reaches only around 0.6, indicating that our model is not highly accurate in predicting salary ranges. This limitation is due to our dataset being entirely dummy data and the chosen of hyperparameters and activation functions. In our future actions, we plan to optimize these parameters to enhance the performance of the neural network model.

6. Conclusion and Discussion¶

We have achieved our objectives and done some interesting findings.

In summary,

(a) The exploratory data analysis (EDA) conducted in this study compared the distribution of graduates across different categories within each variable. Furthermore, an exploration was undertaken by comparing the percentage of graduates in relation to both salary bands and categories within each variable, which indicated postgraduate (research), part-time, and high-skilled students might earn a higher salary within their respective categories.

(b) The random forest classifier emerges as the most apt model for handling this particular dataset, exhibiting an average accuracy of approximately 64% and an AUC value of 0.79. In contrast to alternative classification models, it demonstrates enhanced robustness, effectively addressing the challenges inherent in predicting categories within this dataset.

(c) We conducted a comparison of the impact of various characteristics on salary ranges throughout the dataset. Notably, we identified 'highly skilled,' 'postgraduate (research)', and 'part-time' as the top three important variables. Interestingly, these findings align with the observations made in the exploratory data analysis (EDA) section, indicating a good performance of our model.

In subsequent research, it is plausible to transform each salary category into a continuous variable, facilitating data analysis and modeling wherein the annual income variable is treated as a continuous entity. Simultaneously, we aim to enhance our reports by delving deeper into data visualization tools and identifying more suitable models.

7. References¶

SUBIN AN (2019). The Hitchhiker’s Guide to the Kaggle. [online] kaggle.com. Available at: https://www.kaggle.com/code/subinium/the-hitchhiker-s-guide-to-the-kaggle

Josh (2021). Netflix Data Visualization. [online] kaggle.com. Available at: https://www.kaggle.com/code/joshuaswords/netflix-data-visualization

JOSH (2021). Awesome HR Data Visualization & Prediction. [online] kaggle.com. Available at: https://www.kaggle.com/code/joshuaswords/awesome-hr-data-visualization-prediction

James, Gareth, et al (2023). An Introduction to Statistical Learning: With Applications in Python . Springer Nature. Available at: https://www.statlearning.com/